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DENSITY ESTIMATION FOR GROUPED DATA WITH 
APPLICATION TO LINE TRANSECT SAMPLING 
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University of Georgia and Columbia University 

Line transect sampling is a method used to estimate wildlife pop- 
ulations, with the resulting data often grouped in intervals. Esti- 
mating the density from grouped data can be challenging. In this 
paper we propose a kernel density estimator of wildlife population 
density for such grouped data. Our method uses a combined cross- 
validation and smoothed bootstrap approach to select the optimal 
bandwidth for grouped data. Our simulation study shows that with 
the smoothing parameter selected with this method, the estimated 
density from grouped data matches the true density more closely than 
with other approaches. Using smoothed bootstrap, we also construct 
bias-adjusted confidence intervals for the value of the density at the 
boundary. We apply the proposed method to two grouped data sets, 
one from a wooden stake study where the true density is known, and 
the other from a survey of kangaroos in Australia. 

1. Introduction. In ecology it is often of great interest to study the 
abundance of wildlife populations. A common approach for estimating the 
abundance of a biological population is distance sampling [Barabesi (2000); 
Barabesi, Greco and Naddeo (2002); Chen (1996)], of which line transect 
sampling is an example. A comprehensive review of distance sampling can 
be found in Burnham, Anderson and Laake (1980) and Buckland et al. 
(2001). 

In such studies the detectability of individual data points often varies 
with the distance and selection biases are common. In the basic line tran- 
sect scheme, for example, a number of lines of total length L are randomly 
placed in the region of interest. Observers then move along these lines and 
record the perpendicular distance of each detected animal from the line. 
Animals further away from the lines are more likely to be missed and this 
can be modeled via a detection probability function p(x) that represents the 
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conditional probability of detecting an animal, given that the animal is at a 
perpendicular distance x from the line. Buckland et al. (2001) showed that 
the density function of observed distances, denoted f(x), can be obtained 
from p(x) by rescaling p{x) to integrate to 1. 

In line transect sampling, it is assumed that the line transects are placed 
independently of the animal population so that the animals are distributed 
uniformly in distance from the lines. The decrease in observations with dis- 
tance is then attributed to the detection function p(x). 

Several assumptions about p(x) are also often made. Since animals are 
more likely to be missed with increasing distance from the observer, p(x) is 
assumed to be montonically decreasing with x. Furthermore, it is assumed 
that p(0) = 1 and p'(0) = 0, where p' is the derivative of p with respect to x, 
the former representing the assumption that an animal on the line will not 
be missed. By adding the assumption that J °°p(x)dx < oo, Burnham and 
Anderson (1976) showed that the average number of animals per unit area, 
D, can be estimated with 

(i.i) />==M, 

where n is the number of observations, L is the total length of the line 
transects and /(0) is an estimate of /(0). 

The rather unintuitive formula (1.1) can be better understood as follows; 
suppose that a strip of width 2w and total length L is surveyed and n 
animals are detected. The animal density is then given by 

n 



D 



2wLP a ' 



where P a is the unconditional detection probability of an animal in the strip 
of area 2wL, which can be expressed as 



1 f w 

P a = — p(x) dx. 
w Jo 



With f(x) =p(x)/ J p(x)dx, and p(0) = 1, one can show 

a wf(oy 

giving (1.1). 

Due to the difficulty in measuring distances, the observations are often 
grouped into convenient distance markers, such as multiples of five or ten. 
Thus, estimation of animal populations using line transect sampling involves 
estimating a density function / from grouped data. In particular, the value 
of the density at the boundary, specifically, at x = 0, is of interest. 
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Various estimation techniques have been proposed for use with line tran- 
sect data. Buckland et al. (2001) introduced parametric modeling of /, of 
which Fourier series estimators [Burnham, Anderson and Laake (1980)] form 
a subclass. Other methods include kernel density estimation [Chen (1996); 
Mack and Quang (1998)] and semiparametric methods [Barabesi (2000); 
Barabesi, Greco and Naddeo (2002)]. The reader is asked to refer to the 
cited works for details on these methods. 

Parametric methods work well if the model is correct. Also, in smaller 
data sets, the data may be grouped into as few as 3 or 4 groups. In these 
cases, parametric models using covariate information will be useful [Mar- 
ques and Buckland (1992)]. Here, we focus on nonparametric methods, in 
particular, on kernel density estimation using grouped data. In the context 
of line transect sampling, the aim will be the estimation of f(0). However, 
our proposed method for bandwidth selection in density estimation from 
grouped data has applications beyond line transect sampling (see Section 



Nonparametric estimation of /(0) has a number of challenges in the 
grouped data setting: 

1. Density estimation from grouped data: When data are grouped, using 
risk function estimators such as the cross-validation score function to 
choose the optimal smoothing parameter can be problematic since the 
risk function estimators tend to be monotone decreasing functions of 
the smoothing parameter. As a result, using cross-validation for optimal 
smoothing parameter selection may lead to undersmoothing. 

2. Density estimation at the boundary: Since distances are nonnegative, the 
support of the density should not include any negative values. To satisfy 
this condition, one must modify the original kernel density estimator to 
remove any boundary bias. 

3. Obtaining confidence intervals: The standardized form of the nonpara- 
metric estimator / can be expressed as the sum of two terms: 



While the first term converges to the standard normal distribution by 
the central limit theorem, in nonparametric inference the second term is 
not negligible because of the bias- variance trade-off. Common smoothing 
techniques require the bias and the standard error to be of the same 
order. Therefore, confidence intervals based on the traditional form of 

/ (x) ± z a /2 \Jvax(f(x)) do not necessarily achieve the nominal level. 

Note that the second and third points above are also common issues 
in nonparametric inference for ungrouped data as well. Chen (1996) and 
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Mack and Quang (1998) used kernel methods to address these two issues 
for ungrouped data. Barabesi, Greco and Naddeo (2002) developed a semi- 
parametric method for grouped data, but used the traditional form of the 
confidence interval for /(0). Optimal bandwidth selection plays an impor- 
tant role in addressing the second and third issues whether data are grouped 
or not. In this work, we develop an inference procedure that addresses all 
three issues together. 

Specifically, we propose a combined cross-validation and smoothed boot- 
strap procedure to select the optimal bandwidth in kernel density estimation 
with grouped data and to construct bias-adjusted confidence intervals for the 
density at the boundary. To adjust for the boundary bias, we employ a sym- 
metrization technique introduced by Buckland (1992). Our methods can be 
easily extended to multivariate cases. We are not aware of any other work 
that addresses all aforementioned issues together. 

The paper is organized as follows. Section 2 provides a brief overview of 
kernel density estimators and includes a description of the symmetrization 
technique for kernel density estimates at the boundary. In Section 3 we in- 
troduce a smoothed bootstrap approach for bandwidth selection for grouped 
data. Section 4 explains our approach for constructing bias- adjusted confi- 
dence intervals for /(0) and the animal population density D. We present 
two case studies in Section 5, one using data from a wood stake study and 
the other from a survey of kangaroos in Australia. Section 6 shows the per- 
formance of the proposed method in simulation studies with data generated 
from artificially constructed densities commonly used to test kernel density 
estimators as well as with a simulated line transect data set. Concluding 
remarks follow in Section 7. 

2. Inference for /(0). Suppose that we have a sample X±, . . . ,X n from 
the density function f{x). The nonparametric kernel density estimator of 
f(x) is given by 



where h is the bandwidth and K is a bounded, symmetric kernel function 
integrating to one. In kernel methods, the choice of the bandwidth is more 
crucial than the choice of kernel. The bandwidth specifies the amount of 
smoothing applied to the data and controls the performance of fh(x). For 
grouped data, the choice of h will be addressed in Section 3. We use the 
Gaussian kernel throughout this paper. 

In line transect sampling, the interest is in the value of the density at the 
boundary x = 0, /(0), since this quantity is related to the animal popula- 
tion density. It is well known that kernel estimators suffer from high bias 
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near boundaries [Wasserman (2005)]. Barabesi (2000) used local likelihood 
density estimation to reduce this boundary bias, and Barabesi, Greco and 
Naddeo (2002) extended this approach to grouped data. We will instead 
employ the symmetrization technique used in Chen (1996), originally sug- 
gested by Buckland (1992). The key idea of the symmetrization technique is 
to duplicate the data by reflecting the data about the boundary: we replace 
each data value Xi with Xi and its reflection — Xj about 0. Then we assume 
that the data consist of values y±, . . . ,y2 n where yn-i = x« and y2i = — Xj. 
Thus, X = \Y\ and we have 

f(x)=g(x)+g(-x), 

where g is the density of y. The kernel estimator of g is 

k=i v 7 i=i LV / v /J 

so that we have, as the kernel estimator of / (0), 

A<o)^oH±f>(^A£*(f), 

k=l V 7 i=l v 7 

where the last equality is due to the symmetry of the Gaussian kernel about 
zero. 



3. Bandwidth selector for grouped data. Here, we first describe the 
cross-validation method and the smoothed bootstrap method for bandwidth 
selection with ungrouped data. After highlighting difficulties with using 
these methods with grouped data, we introduce a combined cross-validation 
and smoothed bootstrap strategy that can deal with such grouped data. 

In density estimation, the performance of the density estimate fh is highly 
sensitive to the choice of the smoothing parameter h and one often selects 
the optimal smoothing parameter from observed data using some criterion 
of performance. A common criterion is the risk function, R(f,fh), defined 
to be the expectation of a loss function, L(f,fh), often chosen to be the 
integrated squared error (ISE): 

L(/,A) = ISE = J [f h (x)-f(x)] 2 dx. 

The risk function can be written as a sum of the squared bias term and the 
variance term, 



R(f, f h ) = E(L(f, f h )) = / bias 2 (A(x)) dx + / Var(A(x)) dx, 
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hence, the optimal smoothing parameter is chosen to balance the tradeoff 
between the bias and the variance. 

The integrated squared estimator (ISE) can be written as 



(3.1) 



ISE = J [f h (x)-f(x)] 2 dx 



fl(x)dx-2 J f h (x)f(x)dx + J f 2 (x)dx. 



Since the last term on the right-hand side of (3.1) is independent of h, 
minimizing ISE is equivalent to minimizing the first two terms. As / is not 
known, the middle term has to be estimated, usually by cross-validation or 
bootstrap. 

With cross-validation, this middle term is estimated by 

n 

i=i 

where 

is the kernel estimate of / with the ith data point removed. Thus, the cross- 
validation function is 

n 



CV(h)= [ f 2 h {x)dx-lY,f^{X, 
J n 1-1 



i=l 

and the value of h that minimizes this function is chosen as the bandwidth. 
An asymptotic justification of the cross-validation procedure can be found 
in Stone (1984). 

The bootstrap is an alternative to cross-validation for bandwidth selection 
[Taylor (1989)]. However, the nonparametric bootstrap method of sampling 
the data points with replacement and obtaining bootstrap density estimates 
from the bootstrap samples cannot capture the bias since the bootstrap es- 
timates are unbiased. Thus, the smoothed bootstrap is used instead. This 
involves obtaining an initial density estimate, f(x;hi n ), using a pilot band- 
width value h- m and obtaining smoothed bootstrap samples by drawing sam- 
ples from this initial density estimate. The optimal bandwidth is then the 
value of h that minimizes 

(3.2) BMISE(fc) = E S J [f S (x; h) - f(x; h hl )} 2 dx, 

where f s is the kernel density estimate for the smoothed bootstrap sample 
generated from f(x; hi a ) and Eg represents the mean over the smoothed 
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bootstrap sampling distribution. This smoothed bootstrap approach can 
often perform better than cross-validation. See the work done by Faraway 
and Jhun (1990) and Jones, Marron and Sheather (1996) for the details. 
Faraway and Jhun (1990) recommended choosing the pilot estimate with 
cross-validation. 

The cross-validation and smoothed bootstrap methods described above 
work well with ungrouped data. In practice, however, data are often binned 
or rounded to some extent. Suppose we have a mesh {tk}^ =0 specifying K 
intervals. The actual data (X%, . . . ,X n ) is not recorded as such, but instead 
is of the form (v\, ni), . . . , (vk,tik) where is the count in the bin Bk = 
[tk-i,tk) and Vk is typically taken to be the midpoint of the bin Bk- Often 
the bin size Sk = tk — ife-i is constant for all k, but this is not required in 
our proposed method. 

It is well known that using the cross-validation function to select the 
smoothing parameter leads to undersmoothing if the proportion of tied data 
is larger than some threshold [Silverman (1986)]. Indeed, any reasonable 
risk function estimate may not work as a criterion for choosing the optimal 
smoothing parameter if there is significant overlapping in the data. 

This can be explained heuristically. Since the risk function can be written 
as a sum of a squared bias term and a variance term, selecting the band- 
width that minimizes the risk function selects the amount of smoothing that 
balances the bias and the variance. When data are grouped, however, there 
are additional biases so that the squared bias term dominates the variance 
term in the risk function. As a smaller bandwidth reduces the bias, using the 
risk function produces undersmoothing. In density estimation, this means 
that the selected optimal bandwidth will be close to 0. 

To address this problem, we propose a new bandwidth selection proce- 
dure for kernel density estimation using a combined cross-validation and 
smoothed bootstrap strategy. To estimate /(0) in line transect sampling, 
we use this new method for bandwidth selection (Steps 2-6 below) together 
with the symmetrization technique of Section 2 (Step 1). 

Suppose we have grouped data with a number of counts within each bin. 
The estimation procedure involves the following steps: 

1. Apply the symmetrization technique to Xi,i = l,...,n, to obtain Y = 
(Yi, . . . ,Y2 n ). Note that X is binned, so many of the AYs, and thus the 
YiS, overlap. This symmetrization step is performed to reduce the bound- 
ary bias in the estimation of /(0). The remaining steps form the smoothed 
bootstrap procedure for bandwidth selection for grouped data. 

2. For each bin k = 1, . . . , K, generate noise from the uniform [—5^/2, 6k /2] 
distribution and add them to the data points, so that the data points no 
longer overlap. Let Y u = (Y± , . . . , Y^) denote the new data. 

3. Use cross-validation to calculate the optimal bandwidth for Y . 
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4. Repeat Steps 2 and 3 1000 times and let the average of the optimal 
bandwidths be h\ n . An initial density estimate g(y;hi n ) is then obtained 
using h- m as the pilot bandwidth. 

5. Generate B smoothed bootstrap samples Y s = (Yf , . . . , Yf n ) from g(y; h m ). 

6. With the smoothed bootstrap samples, evaluate BMISE as a function of 
h and find the value of h that minimizes BMISE(/j), denoted hs- 

7. Compute /(0) =2-5(0; /t s ). 

In short, we are using a smoothed bootstrap approach with the pilot 
bandwidth h- in found using cross-validation on grouped data with random 
noise added to them. Note that the smoothed bootstrap in Step 5 above 
produces bootstrap samples Y s from g(y,h- m ), and the optimal bandwidth 
hs is chosen based on Y s not Y u . Thus, the choice of hs is not directly 
affected by the dependence created in the symmetrization step. 

Remark 1 . Smoothed bootstrap samples can be generated from g(y; h in ) 
by rejection sampling, but can be generated more simply as follows: 

1. Use the naive bootstrap to resample Y*, . . . , Y^ n from Y{ , . . . , Yf n - 

2. Generate Zi from K(-). In our case, since we are using the Gaussian kernel, 
we generate z% from the standard normal. 

3. Set Yf = Y* + /t in • Zi for i = 1, . . . , 2n. 

Remark 2 . A referee suggested using a different noise distribution than 
the uniform in Step 2 above, specifically, that the noise distribution be pro- 
portional to the detection function. As our method is intended for applica- 
tions besides distance sampling, we decided not to pursue this here. 

4. Confidence intervals for /(0) and D. In this section we construct 
bootstrap confidence intervals for /(0) based on the kernel density estimates. 
Constructing confidence intervals for densities requires accounting for the 
bias that is not captured in the naive bootstrap procedure. Hall (1992) 
proposed two methods to account for the bias: explicit bias estimation and 
undersmoothing. The former method involves estimating the leading term of 
the bias explicitly to obtain a bias-adjusted bootstrap t-confidence interval. 
The leading term of the bias is a functional of the second derivative of / 
and Hall (1992) suggested using a plug-in kernel estimator of the derivative. 
In the undersmoothing method, a sub-optimal bandwidth of a smaller order 

than the optimal bandwidth is chosen to make [E(/(rc)) — /(x)]/-^/\/ar(/(x)) 
negligible. 

Assuming that the maximum number d for which the dth derivative, 
f^ d \ exists and is known, Hall (1992) compared the two approaches and 
recommended the undersmoothing method. However, in practice, the value 
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of d is usually unknown. Furthermore, there are no useful guidelines for the 
choice of the plug-in kernel estimator of the derivative and the amount of 
undersmoothing. 

Thus, we propose using smoothed bootstrap to estimate the bias of the 
kernel density estimate and construct several confidence intervals based on 
our smoothed bootstrap procedure. These confidence intervals are based on 
studentized and nonstudentized pivot statistics. We use these confidence 
intervals in our simulation and case studies. 

To construct confidence intervals, we first follow Steps 1 to 7 in Section 
3 to generate smoothed resamples. For smoothed resample X? , b = 1, . . . , B, 
define 

Define the pivot statistic b (x) = f b (x; hs) — f(x; h- m ). If we let rf denote 
the a sample quantile of (R^ 1; . . . , B ), then a 100(1 — a)% bootstrap 
pivot confidence interval for f(x) is 

(4-1) (f(x; h s ) - rf_ a/2 J(x; h s ) - r s a/2 ). 

Faraway and Jhun (1990) used a similar pivot statistic to construct simul- 
taneous confidence bands for /. 

An alternative is to construct confidence intervals based on a studentized 
version of the above pivot statistic. It is known that studentized confidence 
intervals are more accurate since these intervals are second-order accurate 
[Wasserman (2005)]. 

With a suitable estimator a b of the standard deviation cr(x) of f(x), we 
can use the studentized pivot 

n S _ fb&hs) ~ f{x;h in ) 

a§(x) 

yielding a 100(1 — a)% bootstrap studentized pivotal interval 

(4.2) (f(x;hs)-uf_ a/2 a(x)J(x;hs)-u s a/2 a(x)), 

where is the a sample quantile of (U^ x , . . . , U~[ B ). Please refer to the 
Appendix for details on how to obtain . 

To construct a confidence interval of D, we use equation (1.1). Here there 
is additional variability in D due to n being random. Buckland et al. (2001) 
showed that the standard error of D is given by 
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If we follow the common practice of using the Poisson for the distribution 
of n, Var(?i) can be estimated by the value of n, and the above expression 
simplifies to 

V v« L/(o)J ) 

We use this latter formula in our analyses and simulation study. 

Using the same approach of defining a studentized pivot statistic, we get 
the following 100(1 — a)% confidence interval for D: 

(D - wf_ a/2 d D ,D - w s a/2 a D ), 

where is the a sample quantile of the pivot statistics W^\, • • • , W~ B 
computed from the bootstrap sample, and <jd is as given above. See the 
Appendix for details. 

5. Case studies. We next look at two case studies, one involving a wooden 
stake data set and the other a survey of kangaroos in Australia. All compu- 
tation, including implementation of our smoothed bootstrap method, was 
done with the R statistical language [R Development Core Team (2008)]. 
The code will be available as supplemental material at the Annals of Ap- 
plied Statistics website. 

5.1. Stake data in Utah. We consider here a wooden stakes data set 
from Logan, Utah, which was also analyzed by Burnham, Anderson and 
Laake (1980), Barabesi (2000), Barabesi, Greco and Naddeo (2002). This 
data set was collected as part of a larger study on line transect sampling. In 
particular, 150 wooden stakes were put within 20 m of a transect line in a 
meadow near Logan, Utah. The length of the transect line was 1000 m and 
the actual density of stakes was known to be D = 37.5 stakes/hectare. An 
observer walked along the transect line and searched visually for the stakes. 

Out of 150 stakes, 68 were observed. The actual perpendicular distances of 
the identified stakes from the transect line are given in Table 6 of Burnham, 
Anderson and Laake (1980). We notice that more than one stake is found at 
some distances. In an actual application these distances are not known, but 
are estimated by the observer. With ten distance categories with end points 
1,2,3,4,5,7,9,11,15,20, the data then consist of counts of 8,6,4,13,7,8, 
7,6,5,4 in the ten distance categories. Note that the intervals do not have 
the same length. 

Figure 1 shows a histogram of the relative frequencies and kernel density 
estimates with bandwidths obtained using different selection methods. 

The density estimate with the smoothed bootstrap bandwidth seems a 
better fit and also yields the estimate /(0) = 0.1033, or D = 35.11, which is 
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closer to the true density D = 37.5. For both grouped and ungrouped data, 
we received the following warning message: minimum occurred at one end 
of the range from R and the lower bounds of the bandwidth range were 
chosen as optimal bandwidths with cross-validation. With cross-validation 
bandwidth selectors based on ungrouped data and grouped data, we found 
D = 34.07 and D = 34.58 respectively. 

Burnham, Anderson and Laake (1980) fit Fourier series models to the 
ungrouped and grouped data to obtain confidence intervals for D. As pointed 
out in Mack and Quang (1998), the Fourier series method requires specifying 
a horizon, the maximum sighting distance, which is not well defined for 
grouped data. 

Barabesi (2000) suggested a local likelihood method to make inference 
for /(0), but the method is mainly developed for ungrouped data. Barabesi, 
Greco and Naddeo (2002) used density estimation with local least squares 
to obtain estimates for D, with bandwidth chosen using a plug- in method. 
While their method can be used for grouped data, the resulting confidence 
interval for /(0) does not account for the estimation bias. 

Table 1 shows the confidence interval we constructed from the wood stake 
data using the bootstrap method described in Section 4. For comparison, 




Fig. 1. Wooden stakes with kernel density estimates. 
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Table 1 
Confidence intervals for D 



Method 95% interval 

Fourier series method with ungrouped data [Burnham, Anderson and (32.28, 45.72) 
Laake (1980)] 

Fourier series method with grouped data [Burnham, Anderson and (23.95, 40.90) 
Laake (1980)] 

Local likelihood [Barabesi (2000)] (27.20, 52.09) 

Local least squares [Barabesi, Greco and Naddeo (2002)] (22.13, 49.25) 

Smoothed bootstrap (26.65, 45.57) 



we have also included in the table confidence intervals obtained from the 
above-mentioned references. While all confidence intervals cover the true 
value D = 37.5, there are interesting differences. 

First note that the first confidence interval is based on the ungrouped 
data and, thus, it is the shortest. Information is lost when data are grouped, 
and it is expected that the other confidence intervals will not be as precise. 
The second confidence interval is based on the Fourier series method, applied 
to the grouped data. This is a parametric method based on the maximum 
likelihood estimator and the length of this confidence interval is shorter than 
other confidence intervals using the grouped data. The confidence interval 
on the third line is based on a method developed for ungrouped data but 
applied to the grouped data. Notice the much wider confidence interval 
obtained as a result. While the fourth confidence interval is valid, it fails 
to consider the estimation bias. Note that our confidence interval is shorter 
than other nonparametric confidence intervals. 

5.2. Kangaroo survey data from Australia. Southwell and Weaver (1993) 
compared various density estimation techniques for line transect data using 
a data set of kangaroo sightings collected at two locations in Australia, 
Wallaby Creek and Tidbinbilla Nature Reserve. 

The line-transect work was conducted in a 1.5 km 2 region in Wallaby 
Creek and in a 0.2 km 2 region in Tidbinbilla Nature Reserve. At each site, a 
grid of equally-spaced parallel lines were marked, 100 m and 50 m apart re- 
spectively at Wallaby Creek and Tidbinbilla Nature Reserve. An observation 
session would consist of first randomly selecting a transect and a direction. 
An observer would traverse that transect, then another line transect 400 m 
(Wallaby Creek) or 200 m (Tidbinbilla) away, and so on, alternating the di- 
rection with each subsequent line transect. Each observation session would 
focus on a particular species, the eastern grey kangaroo (Macropus gigan- 
teus) or red-necked wallaby (M. rufogriseus) in Wallaby Creek and the red 
kangaroo (M. rufus) in Tidbinbilla. 
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The kangaroos at both locations were used to the presence of humans. 
This allowed the line transects to be more closely spaced than would nor- 
mally be done. Furthermore, it is also then relatively straightforward to per- 
form a census of the kangaroo populations. Thus, the true kangaroo popula- 
tion sizes are known, and serve as a point of comparison for the line-transect 
estimation techniques. Here, we will only use their data on sightings of the 
eastern grey kangaroo in Wallaby Creek. 

Figure 2 shows a histogram of the eastern grey kangaroo data, together 
with kernel density estimates obtained using optimal bandwidths selected 
using cross-validation on the grouped data and using our smoothed boot- 
strap method. Also shown is a density estimate obtained using the Distance 
software program [Thomas et al. (2009)]. This density estimate was obtained 
from the model with a Uniform key function and polynomial adjustment to 
the tails. This model was selected from among the other alternatives using 
AIC as the criterion. 

The cross-validation approach yields a density that essentially has a peak 
at every bin, while the density obtained with the Distance software program 
suggests that too much smoothing may have been applied. The density es- 
timates obtained from the models with the next two smallest AIC values, 

density estimate 

CM -. 



o - 




d 

I 1 1 1 1 

50 100 150 200 

distance 



Fig. 2. Eastern grey kangaroo in Wallaby Creek with kernel density estimates. 
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using hazard-rate and half- normal key functions with cosine adjustments, 
also suggest over-smoothing (not shown). The density estimate based on 
our smoothed bootstrap approach attains a better fit to the data, with a 
good balance between smoothing and retaining the peaks. 

The true density D is known to be 44 animals per km 2 . Using the smoothed 
bootstrap approach, we obtained D = 43.71 and a 95% confidence interval 
of (37.63, 50.51). For comparison, with the line transect estimate based on 
the Uniform model with polynomial adjustment, we have D = 39.16 with 
95% confidence interval (34.91, 43.94). 

6. Simulation. This section contains two parts. Section 6.1 studies the 
performance of the bandwidth selection procedure together with the sym- 
metrization technique for estimating /(0) using simulated line transect data. 

As our method is applicable to areas beyond line transect sampling, it is 
of interest to explore its performance under a variety of settings. In Section 
6.2 we apply our bandwidth selection method to data generated from arti- 
ficially constructed densities. These densities are mixtures of normals and 
while such densities are not considered likely in real applications, they are 
nevertheless commonly used in the density estimation community to assess 
different methods. Here, the aim is to estimate the whole density function 
using the selected bandwidth. 

6.1. Simulation study 1. Here, we consider a simulated line transect data 
set that was generated by Buckland et al. (2001) for comparing various line 
transect data analyses. We briefly describe it below, referring the reader to 
Buckland et al. (2001) for more details. 

The data set was simulated so that the assumptions for line transect 
sampling hold. It was based on the context of line transect sampling using 
12 parallel line transects of varying lengths within a region of irregular shape. 

Table 2 

Estimates o//(0) using smoothed bootstrap, cross-validation and several parametric 
models fit by Buckland et al. (2001) 



Method 


/(0) 


D 


Smoothed bootstrap 


0.0751 


82.14 


Cross-validation 


0.0726 


79.44 


Uniform + cosine 


0.0732 


80.06 


Uniform + polynomial 


0.0681 


74.43 


Half-normal + Hermite 


0.0794 


86.87 


Hazard-rate + cosine 


0.0769 


84.06 


True value 


0.0798 


79.79 
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Fig. 3. Histogram of the simulated transect line sampling data together with kernel den- 
sity estimates with bandwidths selected using smoothed bootstrap and cross-validation with 
grouped data. 

The detection function used was the half-normal and the true values of /(0) 
and D are 0.0798 m" 1 (to three significant figures) and 79.79 objects per 
km 2 . The model was set up so that the expected number of observations was 
96. The simulated data set has 105 observations. The original data set was 
ungrouped, but was grouped in various ways by Buckland et al. (2001) for 
use with some of the methods considered there. We use the data which had 
been grouped into 20 groups of equal width. Figure 3 shows a histogram of 
the raw data. 

We applied smoothed bootstrap and cross-validation for grouped data to 
this data set. The resulting density estimates are shown in Figure 3. Esti- 
mates of /(0) and D are shown in Table 2. This table also contains estimates 
taken from Table 4.2 of Buckland et al. (2001), obtained using parametric 
models fit to the data. These involve fitting a key function (uniform, half- 
normal or hazard-rate) to the data and then applying an adjustment (cosine, 
polynomial or Hermite) to the tails. 

Since the true detection function is half-normal, it is not surprising that 
the half-normal (with Hermite adjustment) gave an estimate of /(0) clos- 
est to the true value. The estimate obtained with smoothed bootstrap was 
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closer to the true value than the cross-validation estimate and the estimates 
obtained using the uniform model. Note that to get D, the expected value 
E(n) = 96 was used in the formula (1.1), while for the estimates, n = 105 
was used. 

We note also that the simulated data had an outlier and Buckland et 
al. (2001) recommended truncating about 5% of the data, corresponding to 
dropping six of the largest observations in this case. After truncation, /(0) 
and D were 0.0844 and 87.98 using the half-normal model with Hermite 
adjustment. Our method is nonparametric and, hence, we do not make as- 
sumptions about the form of the density. In particular, our estimate /(0) 
is robust to outliers in the tails because the kernel estimate is based on lo- 
cal smoothing. Hence, the presence of outliers does not adversely affect the 
estimation of /(0) and our method does not require truncation. 

Using the formulas in Section 4, we obtained standard errors of 0.017 
and 20.36 for /(0) and D respectively, assuming the Poisson distribution 
as the sampling distribution for n. Nominal 95% confidence intervals for 
/(0) and D, obtained by bootstrap, were (0.061,0.092) and (67.18,101.33) 
respectively. 

With a normal approximation approach in Buckland et al. (2001), 95% 
confidence intervals for D are (60.14, 113.60) and (59.36, 116.30) (with trun- 
cation) . 

6.2. Simulation study 2. In this section we present results from a simu- 
lation study testing the effectiveness of our bandwidth selection method for 
estimating the whole density function from binned data, a special case of 
grouped data. 

We used four mixture normal densities taken from Marron and Wand 
(1992). The parameters for the mixture densities are shown in Table 3 and 
plots of these densities are shown in Figure 4 (solid lines). All simulation 
studies were implemented using the R. We generated a sample of size 500 
from each of these densities and binned the data using a bin size of 0.25. 

Thus, we have two data sets for each model, one raw and one binned. 
Optimal bandwidth selection using cross-validation was applied to each data 

Table 3 

Parameters for mixture normal densities 



Model Density 

1 Gaussian 

2 Separated bimodal 

3 Claw 

4 Asymmetric claw 



2V(0,1) 

IJV(-|,(i) a ) + IJV(|,(|) a ) 
|JV(0,l) + £ 4 fc=0 iiV(§-l,(i) 2 ) 

ijv(o, i) + EL-2 ^N(k + U 2 -^-) 2 ) 
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Fig. 4. Plots shomng the true densities of the 4 models we considered (solid lines) and 
the kernel density estimates (dashed lines) using the cross-validation optimal bandwidths 
obtained from the binned data, h^™ . 

set, yielding bandwidth values and h^ i i for i = 1, . . . , 4, which are cross- 
validation optimal bandwidths obtained from the ith raw data set and ith 
binned data set respectively. In R, this is done using the function bw.ucv. 
Since this is an optimization problem, a built-in range of bandwidths is 
used in the function. For all models we considered, applying the function 
to the binned data sets yielded the warning message " minimum occurred at 
one end of the range" suggesting that the optimal bandwidths found using 
cross-validation are near 0. 
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Fig. 5. Plots showing the true densities of the 4 models we considered (solid lines) and 
the kernel density estimates (dashed lines) using the cross-validation optimal bandwidths 
obtained from the original, raw data, hl^ . 



Each pair of selected bandwidths are then used with the binned data to 
obtain kernel density estimates. The results are shown in Figures 4 and 5, 
which are respectively plots of the kernel density estimators using /i,^" and 

I, raw 

Figure 4 shows the problem of using cross-validation on the binned data 
to obtain optimal bandwidths. As can be seen, the selected bandwidths /ij?^ 
are too small, resulting in severe under-smoothing (dashed lines). In Figure 
5 we find that if the underlying true density is relatively smooth (models 1 
and 2), using the optimal bandwidths for the raw data, /i™™, on the binned 
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Fig. 6. Plots showing the true densities of the 4 models we considered (solid lines) and 
the kernel density estimates (dashed lines) using the initial bandwidths hi n obtained from 
Step 4 of the smoothed bootstrap procedure. 

data works well. However, if the true density is less smooth, using h™^ is 
not appropriate for the binned data. Thus, methods such as that proposed 
by Chiu (1991) that aim to obtain approximations to /i™ w may not work if 
the true density is not sufficiently smooth. 

Figure 6 shows plots of kernel density estimates using the pilot bandwidths 
h- in obtained from Step 4 of our procedure described in Section 3. These 
plots are similar to those in Figure 5, with density estimates close to the 
true densities if the true densities are sufficiently smooth, but with severe 
under-smoothing otherwise. 
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Model 1 Model 2 
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Fig. 7. Plots showing the true densities of the 4 models we considered (solid lines) and the 
kernel density estimates (dashed lines) using optimal bandwidths selected using smoothed 
bootstrap, hs ■ The dashed lines are density estimates using cross-validation optimal band- 
width from the original, raw data. 

Plots of kernel densities estimates using the smoothed bootstrap optimal 
bandwidths hs are shown in Figure 7 (dotted lines). For comparison, the 
kernel density estimates using /i^ w with the raw data (the best case scenario) 
are also shown in dashed lines. Note that the dotted lines are very close to 
the dashed lines in spite of some information loss due to the binning. It is 
clear that in all the models we considered, the resulting density estimates 
are much smoother and closer to the true densities than using h~^, /i™ w or 
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Table 4 

Bandwidth comparisons for mixture normal densities 





CV with 


CV with 


Initial 


Smoothed 


Model 


raw data 


binned data 


bandwidth 


bootstrap 


1 


0.316 


0.034 


0.154 


0.154 


2 


0.191 


0.054 


0.148 


0.144 


3 


0.058 


0.028 


0.066 


0.074 


4 


0.093 


0.034 


0.101 


0.112 



h- m on the binned data. Table 4 summarizes the optimal bandwidth values 
chosen by different bandwidth selectors. 

7. Concluding remarks. In this paper we introduced a combined cross- 
validation and smoothed bootstrap approach for obtaining kernel density es- 
timates from grouped data. Our simulation results show that the smoothing 
parameter found using our method produced density estimates that matched 
the true density most closely compared with competing methods. 

In line transect sampling it is the value of the density at the boundary, 
specifically /(0), that is of interest, since the estimate of f(0) is used to 
estimate the animal population density. We showed that the symmetrization 
technique of Chen (1996) together with our bandwidth selection procedure 
was able to produce good estimates of both the stake density and the eastern 
grey kangaroo density. 

There are some limitations to our method. For application to line transet 
sampling, we are restricted to data that is sufficiently large and grouped into 
about 10 intervals. With smaller data sizes, the data may be grouped into 
as few as 3 or 4 intervals. In such cases, we do not expect a nonparametric 
kernel method to work well. Often, a parametric model involving covariates 
is used instead. 

The methodology developed in this paper has wider potential application 
in other scientific areas. For example, economists often want to make in- 
ference for income distributions in developing counties where only grouped 
data are available to outsiders [Wu and Perloff (2007)]. In astronomy, Efron 
and Tibshirani (1996) applied a semiparametric density estimator to the es- 
timation for density of galaxy for which counts on a fine grid are variables. 
Complex survey data are another possible application [Bellhouse and Staffer 
(1999)]. We will explore some of these applications in future work. 
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APPENDIX A: ESTIMATION OF o 2 (X) 



We describe how to estimate the variance <r 2 (x) of f(x). It can be shown 
that 

2 



a\x) 



n 



1 

7? 



K 



x-y 
h 



f(y)dy 



1 



A" 



x-y 
h 



f(y)dy 



and Hall (1992) proposed the following estimator of a{x): 



1 n 
nh ^ 



i=l 



x-Xj 
h 



hf(x) 



With our smoothed bootstrap samples, we can estimate the variance by 



1 



nh s 



1 

nh s 



J2 K 

i=l 



X 



X s \ 2 

^Lb N 



h S 



and use cr b {x) in the studentized pivot statistic U^ b . 

APPENDIX B: CONFIDENCE INTERVAL FOR D 
Chen (1996) showed that 

D-D- bias(L>) 



0"D 



N{0,1) 



where bias(l)) = nf( 2 \o)h 2 /(2L) and is the second derivative of /. 

Based on the same approach that we used to obtain a confidence interval 
for /(0), we define a studentized pivot statistic: 



a- 



b,D 



where 



D h . 

"an 



nf(0;h i: 
2L 



D 



s _ nf b s (0;h s ) 
2L 



Var(n) 



+ 



With B bootstrap samples, we get values 1; . . . , W^ B . A 100(1 — a) 
confidence interval for D is then given by 

(D - wf_ a/2 a D ,D - w s a/2 a D ), 
where is the a sample quantile of (W^f 1; . . . , B ). 
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SUPPLEMENTARY MATERIAL 

R codes for simulation and case studies (DOI: 10. 1214/09- AOAS307SUPP; 
.zip). This zip files contains two R scripts for the simulation and case studies 
described in Jang and Loh (2009). 
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